Valuation Model

Setup and Context

Introduction

Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:

  • The number of rooms
  • The distance to employment centres
  • How rich or poor the area is
  • How many students there are per teacher in local schools etc

To accomplish your task you will:

  1. Analyse and explore the Boston house price data
  2. Split your data for training and testing
  3. Run a Multivariable Regression
  4. Evaluate how your model's coefficients and residuals
  5. Use data transformation to improve your model performance
  6. Use your model to estimate a property price

Import Statements

Code
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

from sklearn.linear_model import LinearRegression

Load the Data

The first column in the .csv file just has the row numbers, so it will be used as the index.

Code
data = pd.read_csv('data/boston.csv', index_col=0)

Understand the Boston House Price Dataset

Characteristics:

  • Number of Instances: 506
  • Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
  • Attribute Information (in order): 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. PRICE Median value of owner-occupied homes in $1000's
  • Missing Attribute Values: None
  • Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.

Preliminary Data Exploration 🔎

Challenge

  • What is the shape of data?
  • How many rows and columns does it have?
  • What are the column names?
  • Are there any NaN values or duplicates?
Code
data.shape
data.columns
data.isna().sum()
data.duplicated().sum()
0

Descriptive Statistics

Challenge

  • How many students are there per teacher on average?
Code
print(f'The average pupil-to-teacher ratio in the data is {data["PTRATIO"].mean():.2f}.')
The average pupil-to-teacher ratio in the data is 18.46.
  • What is the average price of a home in the dataset?
Code
print(f'The average price of a home in the dataset is ${1000 * data["PRICE" ].mean():,.2f}.')
The average price of a home in the dataset is $22,532.81.
  • What is the CHAS feature?
Code
print(f'It is a variable indicating whether a property is next to the Carles River or not. In the dataset {100 * data[data["CHAS"] == 1].shape[0] / data.shape[0]:.2}% of the properties were next to the Charles river.')
It is a variable indicating whether a property is next to the Carles River or not. In the dataset 6.9% of the properties were next to the Charles river.
  • What is the maximum and the minimum number of rooms per dwelling in the dataset?
Code
print(f'The minumum number of rooms is {data["RM"].min():.1f}.')
print(f'The maximum number of rooms is {data["RM"].max():.1f}.')
The minumum number of rooms is 3.6.
The maximum number of rooms is 8.8.

Visualise the Features

Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:

  • PRICE: The home price in thousands.
Code
sns.displot(data, x = "PRICE")
plt.show()

  • RM: the average number of rooms per owner unit.
Code
sns.displot(data, x = "RM")
plt.show()

  • DIS: the weighted distance to the 5 Boston employment centres i.e., the estimated length of the commute.
Code
sns.displot(data, x = "DIS")
plt.show()

  • RAD: the index of accessibility to highways.
Code
sns.displot(data, x = "RAD")
plt.show()

Challenge

Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:

You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])

Code
close_river = data.groupby("CHAS", as_index = False).size()
close_river["CHAS"] = close_river["CHAS"].astype(str)

fig = px.bar(close_river, x = "CHAS", y = "size", labels={"CHAS": "Close to the charles River?", "size": "Number of properties"})
fig.update_xaxes(type = "category", labelalias = {"0.0": "Not close", "1.0": "Close"})
fig.show()

Understand the Relationships in the Data

Run a Pair Plot

Challenge

There might be some relationships in the data that we should know about. Before you run the code, make some predictions:

  • What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
  • What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
  • What about the amount of poverty in an area (LSTAT) and home prices?

Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.

Code
sns.pairplot(data)
plt.show()

Challenge

Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:

  • DIS and NOX
Code
sns.jointplot(data, x = "DIS", y = "NOX", alpha = .3)
plt.show()

  • INDUS vs NOX
Code
sns.jointplot(data, x= "INDUS", y = "NOX", alpha = .3)
plt.show()

  • LSTAT vs RM
Code
sns.jointplot(data, x = "LSTAT", y = "RM", alpha = .3)
plt.show()

  • LSTAT vs PRICE
Code
sns.jointplot(data, x = "LSTAT", y = "PRICE", alpha = .3)
plt.show()

  • RM vs PRICE
Code
sns.jointplot(data, x = "RM", y = "PRICE", alpha = .3)
plt.show()

Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.

Split Training & Test Dataset

We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.

Challenge

Code
from sklearn.model_selection import train_test_split
  • Create 4 subsets: X_train, X_test, y_train, y_test
  • Split the training and testing data roughly 80/20.
  • To get the same random split every time you run your notebook use random_state=10. This helps us get the same results every time and avoid confusion while we're learning.

Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.

Code
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ["PRICE"]), data["PRICE"], test_size = .2, random_state = 10)

Multivariable Regression

In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:

\[ PRICE = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + \ldots + \theta_{13} LSTAT\]

Run Your First Regression

Challenge

Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?

Code
reg = LinearRegression().fit(X_train, y_train)

Evaluate the Coefficients of the Model

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).

Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.

Code
coefficients_train = pd.DataFrame(zip(X_train.columns, reg.coef_))
coefficients_train.columns = ["Feature", "Coefficient"]
display(Markdown(coefficients_train.to_markdown()))
Feature Coefficient
0 CRIM -0.128181
1 ZN 0.0631982
2 INDUS -0.00757628
3 CHAS 1.97451
4 NOX -16.272
5 RM 3.10846
6 AGE 0.0162922
7 DIS -1.48301
8 RAD 0.303988
9 TAX -0.0120821
10 PTRATIO -0.820306
11 B 0.011419
12 LSTAT -0.581626
  • We already saw that RM on its own had a positive relation to PRICE based on the scatter plot. Is RM's coefficient also positive?
Code
coef_rm = coefficients_train[coefficients_train['Feature'] == 'RM']['Coefficient'].iloc[0]
print(f"The coefficient of the number of rooms in the regression is {coef_rm:.2f}, which is {np.where(coef_rm < 0, 'negative', 'positive')}, and thus aligned with the expectation.")
The coefficient of the number of rooms in the regression is 3.11, which is positive, and thus aligned with the expectation.
  • What is the sign on the LSAT coefficient? Does it match your intuition and the scatter plot above?
Code
coef_lstat = coefficients_train[coefficients_train['Feature'] == 'LSTAT']['Coefficient'].iloc[0]
print(f"The coefficient of the percentage of lower status to the total population in the regression is {coef_lstat:.2f}, which is {np.where(coef_lstat < 0, 'negative', 'positive')}, and thus aligned with the expectation.")
The coefficient of the percentage of lower status to the total population in the regression is -0.58, which is negative, and thus aligned with the expectation.
  • Based on the coefficients, how much more expensive is a room with 6 rooms compared to a room with 5 rooms? According to the model, what is the premium you would have to pay for an extra room?
Code
print(f"For each extra room, a property is ${1000 * coef_rm:,.2f} more expensive, all other factors kept constant.")
For each extra room, a property is $3,108.46 more expensive, all other factors kept constant.

Analyse the Estimated Values & Regression Residuals

The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions (\(\hat{y_i}\)) and the true values (\(y_i\)) inside y_train.

Code
predicted_values = reg.predict(X_train)
residuals = (y_train - predicted_values)

Challenge: Create two scatter plots.

The first plot should be actual values (y_train) against the predicted value values:

Code
fig, ax = plt.subplots()
ax.scatter(y_train, predicted_values, color = "blue", alpha = .3)
ax.axline((0, 0), slope = 1, color = "cyan")
ax.set_ylabel("Predicted values")
ax.set_xlabel("Actual prices (training set)")
plt.show()

The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊

The second plot should be the residuals against the predicted prices. Here's what we're looking for:

Code
f, ax = plt.subplots()
ax.scatter(predicted_values, residuals, color = "purple", alpha = .3)
ax.set_xlabel("Predicted values")
ax.set_ylabel("Residuals")
plt.show()

Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.

We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.

In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:

Challenge

  • Calculate the mean and the skewness of the residuals.
Code
from scipy.stats import skew

print(f"The mean of the residuals is {residuals.mean():.2f}, and the skewness is {skew(residuals):.2f}")
The mean of the residuals is 0.00, and the skewness is 1.45
  • Again, use Seaborn's .displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)
  • Is the skewness different from zero? If so, by how much?
  • Is the mean different from zero?
Code
sns.displot(residuals.to_frame(), x = "PRICE")
plt.show()

Data Transformations for a Better Fit

We have two options at this point:

  1. Change our model entirely. Perhaps a linear model is not appropriate.
  2. Transform our data to make it fit better with our linear model.

Let's try a data transformation approach.

Challenge

Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.

  • Use Seaborn's .displot() to show a histogram and KDE of the price data.
  • Calculate the skew of that distribution.
Code
sns.displot(data, x = "PRICE")
plt.show()

print(f"The skewness of the Price original data is {skew(data['PRICE']):.2f}.")

The skewness of the Price original data is 1.10.
  • Use NumPy's log() function to create a Series that has the log prices
  • Plot the log prices using Seaborn's .displot() and calculate the skew.
  • Which distribution has a skew that's closer to zero?
Code
data["log_price"] = np.log(data["PRICE"])
sns.displot(data, x = "log_price")
plt.show()
print(f"The skewness of the log-transformed price is {skew(data['log_price']):.2f}.")

The skewness of the log-transformed price is -0.33.

How does the log transformation work?

Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:

We can see this when we plot the actual prices against the (transformed) log prices.

Code
f, ax = plt.subplots()
ax.scatter(data["PRICE"], data["log_price"], alpha = .3)
ax.set_ylabel("Log-transformed price")
ax.set_xlabel("Price")
plt.show()

Regression using Log Prices

Using log prices instead, our model has changed to:

\[ \log{PRICE} = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + \ldots + \theta_{13} LSTAT \]

Challenge:

  • Use train_test_split() with the same random state as before to make the results comparable.
  • Run a second regression, but this time use the transformed target data.
Code
X_log_train, X_log_test, y_log_train, y_log_test = train_test_split(data.drop(columns = ["PRICE", "log_price"]), data["log_price"], test_size = .2, random_state = 10)
reg_log = LinearRegression().fit(X_log_train, y_log_train)
  • What is the r-squared of the regression on the training data?
  • Have we improved the fit of our model compared to before based on this measure?

Evaluating Coefficients with Log Prices

Challenge: Print out the coefficients of the new regression model.

Code
coefficients_log_train = pd.DataFrame(zip(X_log_train.columns, reg_log.coef_))
coefficients_log_train.columns = ["Feature", "Coefficient"]
display(Markdown(coefficients_log_train.to_markdown()))
Feature Coefficient
0 CRIM -0.0106717
1 ZN 0.00157929
2 INDUS 0.0020299
3 CHAS 0.0803305
4 NOX -0.704068
5 RM 0.0734044
6 AGE 0.000763302
7 DIS -0.0476333
8 RAD 0.0145651
9 TAX -0.000644998
10 PTRATIO -0.0347948
11 B 0.000515896
12 LSTAT -0.0313901
  • Do the coefficients still have the expected sign?
  • Is being next to the river a positive based on the data?
  • How does the quality of the schools affect property prices? What happens to prices as there are more students per teacher?

Hint: Use a DataFrame to make the output look pretty.

Regression with Log Prices & Residual Plots

Challenge:

  • Copy-paste the cell where you've created scatter plots of the actual versus the predicted home prices as well as the residuals versus the predicted values.
  • Add 2 more plots to the cell so that you can compare the regression outcomes with the log prices side by side.
  • Use indigo as the colour for the original regression and navy for the color using log prices.
Code
predicted_log_values = reg_log.predict(X_log_train)
residuals_log = (y_log_train - predicted_log_values)

fig, ax = plt.subplots()
ax.scatter(y_log_train, predicted_log_values, color = "blue", alpha = .3)
ax.axline((1.5, 1.5), slope = 1, color = "cyan")
ax.set_ylabel("Predicted log values")
ax.set_xlabel("Actual log prices (training set)")
plt.show()

f, ax = plt.subplots()
ax.scatter(predicted_log_values, residuals_log, color = "purple", alpha = .3)
ax.set_xlabel("Predicted log values")
ax.set_ylabel("Residuals (log)")
plt.show()

print(f"The mean of the residuals_log is {residuals_log.mean():.2f}, and the skewness is {skew(residuals_log):.2f}")
sns.displot(residuals_log.to_frame(), x = "log_price")
plt.show()

The mean of the residuals_log is -0.00, and the skewness is 0.09

Challenge:

Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?

Compare Out of Sample Performance

The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.

Challenge

Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?

Code
from sklearn.metrics import r2_score

predicted_test_normal = reg.predict(X_test)
predicted_test_log = reg_log.predict(X_log_test)

r2_training_normal = r2_score(y_train, predicted_values)
r2_training_log = r2_score(y_log_train, predicted_log_values)
r2_test_normal = r2_score(y_test, predicted_test_normal)
r2_test_log = r2_score(y_log_test, predicted_test_log)

df_r2 = pd.DataFrame({"Type": ["Training set, normal", "Training set, log", "Test set, normal", "Test set, log"], "R2": [r2_training_normal, r2_training_log, r2_test_normal, r2_test_log]})

display(Markdown(df_r2.to_markdown(index = False)))
Type R2
Training set, normal 0.750122
Training set, log 0.793023
Test set, normal 0.670934
Test set, log 0.744692

Predict a Property's Value using the Regression Coefficients

Our preferred model now has an equation that looks like this:

\[ \log{PRICE} = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + ... + \theta_{13} LSTAT \]

The average property has the mean value for all its charactistics:

Starting Point: Average Values in the Dataset

Code
features = data.drop(['PRICE', "log_price"], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
                              columns=features.columns)
display(Markdown(property_stats.T.to_markdown()))
0
CRIM 3.61352
ZN 11.3636
INDUS 11.1368
CHAS 0.06917
NOX 0.554695
RM 6.28463
AGE 68.5749
DIS 3.79504
RAD 9.54941
TAX 408.237
PTRATIO 18.4555
B 356.674
LSTAT 12.6531

Challenge

Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.

Challenge

Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:

Define Property Characteristics

  • next_to_river = True
  • nr_rooms = 8
  • students_per_classroom = 20
  • distance_to_town = 5
  • pollution = data.NOX.quantile(q=0.75) # high
  • amount_of_poverty = data.LSTAT.quantile(q=0.25) # low
Code
x_particular = property_stats.copy()
x_particular["CHAS"] = 1
x_particular["RM"] = 8
x_particular["PTRATIO"] = 20
x_particular["DIS"] = 5
x_particular["NOX"] = data["NOX"].quantile(.75) # high
x_particular["LSTAT"] = data["LSTAT"].quantile(.25) # low

y_particular = reg_log.predict(x_particular)

print(f"The property has a predicted value of ${np.exp(y_particular)[0] * 1000:,.2f}.")
The property has a predicted value of $25,792.03.